BITCOIN


Bitcoin, introduced in 2009 by the mysterious Satoshi Nakamoto, stands as the pioneering cryptocurrency. It operates on a decentralized peer-to-peer network, allowing secure and transparent transactions without the need for a central authority. Bitcoin’s blockchain technology records every transaction in a public ledger, ensuring immutability and trustworthiness through cryptographic hashing.

Our project is structured into two key sections:

1. Data Exploration - Analyzing and visualizing Bitcoin price data. - Identifying trends and patterns in historical price movements.

2. Time Series Analysis and Forecasting - Forecasting Methods: - LSTM (Long Short-Term Memory): Leveraging deep learning to capture complex temporal dependencies. - XGBoost: Harnessing gradient boosting to predict Bitcoin price fluctuations. - Facebook Prophet: Using a versatile tool for forecasting with seasonality and holidays. - ARIMA (AutoRegressive Integrated Moving Average): Employing a traditional statistical approach for time series forecasting.

This project draws inspiration from notable kernels and tutorials, including those focused on Prophet, XGBoost, and ARIMA models, tailored specifically for Bitcoin price prediction.

For those new to Bitcoin, you can learn more about its fundamentals here.


Checking out the dataset in my data folder

From the code belowe we set a default data folder and we run thhe code and it lists the data files in the folder for wour case its

This file has the daily datafrom 2012 first Jannuary to 2018 eleventh November hourly makret prices for BTC

This file has the daily datafrom 2014 first Jannuary to 2018 eleventh November hourly makret prices for BTC

# Set the directory path where the data is located
data_dir <- "~/Git/Data"

# List files in the directory specified by data_dir
dir_contents <- list.files(data_dir)

# Print the contents of the directory to the console
print(dir_contents)
## [1] "bitstampUSD_1-min_data_2012-01-01_to_2018-11-11.csv"
## [2] "coinbaseUSD_1-min_data_2014-12-01_to_2018-11-11.csv"

Environment preparation

Getting pakages

# Define the list of package names
packages <- c("tidyverse", "plotly", "lubridate", "readr", "dplyr", "zoo", "ggplot2", "forecast", "stats", "tseries")
# Use lapply to load each package
lapply(packages, require, character.only = TRUE)
## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] TRUE
## 
## [[5]]
## [1] TRUE
## 
## [[6]]
## [1] TRUE
## 
## [[7]]
## [1] TRUE
## 
## [[8]]
## [1] TRUE
## 
## [[9]]
## [1] TRUE
## 
## [[10]]
## [1] TRUE

Data importation

  • The data importation utliises read_csv function from readr library. our Data ahas a Timestamp which will be converteed to readbale from with the code line mutate(Timestamp = as.POSIXct(Timestamp, origin = "1970-01-01", tz = "UTC"))

  • the lineinterprates the Timestamp to human readable form. We set the format to UTC then give it a sample date to use like in our case 1970-01-01. and set the path to our file

  • here we end the code by displaying the firt 6 rows of data

# Define a function to parse timestamps from seconds since the Unix epoch
dateparse <- function(time_in_secs) {
  as_datetime(as.numeric(time_in_secs), origin = "1970-01-01", tz = "UTC")
}

# Expand the tilde to the home directory and specify the file path
file_path <- "~/Git/Data/coinbaseUSD_1-min_data_2014-12-01_to_2018-11-11.csv"

# Read the CSV file into a data frame named 'data'
data <- read_csv("~/Git/Data/coinbaseUSD_1-min_data_2014-12-01_to_2018-11-11.csv")

# Convert the 'Timestamp' column from numeric Unix timestamps to POSIXct date-time objects
data <- data %>%
  mutate(Timestamp = as.POSIXct(Timestamp, origin = "1970-01-01", tz = "UTC"))

# Display the first few rows of the 'data' data frame
head(data)
## # A tibble: 6 × 8
##   Timestamp            Open  High   Low Close `Volume_(BTC)` `Volume_(Currency)`
##   <dttm>              <dbl> <dbl> <dbl> <dbl>          <dbl>               <dbl>
## 1 2014-12-01 05:33:00   300   300   300   300           0.01                   3
## 2 2014-12-01 05:34:00   NaN   NaN   NaN   NaN         NaN                    NaN
## 3 2014-12-01 05:35:00   NaN   NaN   NaN   NaN         NaN                    NaN
## 4 2014-12-01 05:36:00   NaN   NaN   NaN   NaN         NaN                    NaN
## 5 2014-12-01 05:37:00   NaN   NaN   NaN   NaN         NaN                    NaN
## 6 2014-12-01 05:38:00   NaN   NaN   NaN   NaN         NaN                    NaN
## # ℹ 1 more variable: Weighted_Price <dbl>

Summary statistics

The summary statistics ddiplays the minimum valuue and maximum value which explains the range. the 1st median meand and third quartile range are also in play they elaborate the interquatile ranges. the summary also highlights the total number of missing variables ineach columns.

summary(data)
##    Timestamp                           Open               High         
##  Min.   :2014-12-01 05:33:00.00   Min.   :    0.06   Min.   :    0.06  
##  1st Qu.:2015-12-26 19:35:15.00   1st Qu.:  415.60   1st Qu.:  415.68  
##  Median :2016-12-10 21:03:30.00   Median :  909.96   Median :  910.10  
##  Mean   :2016-12-10 18:53:59.23   Mean   : 3208.58   Mean   : 3209.99  
##  3rd Qu.:2017-11-25 22:31:45.00   3rd Qu.: 6374.66   3rd Qu.: 6375.15  
##  Max.   :2018-11-11 00:00:00.00   Max.   :19891.99   Max.   :19891.99  
##                                   NA's   :108957     NA's   :108957    
##       Low               Close           Volume_(BTC)     Volume_(Currency) 
##  Min.   :    0.06   Min.   :    0.06   Min.   :   0.00   Min.   :       0  
##  1st Qu.:  415.50   1st Qu.:  415.59   1st Qu.:   0.88   1st Qu.:     596  
##  Median :  909.57   Median :  909.98   Median :   2.64   Median :    3313  
##  Mean   : 3207.06   Mean   : 3208.58   Mean   :   7.68   Mean   :   35513  
##  3rd Qu.: 6373.31   3rd Qu.: 6374.67   3rd Qu.:   7.48   3rd Qu.:   18552  
##  Max.   :19891.98   Max.   :19891.99   Max.   :1563.27   Max.   :19970765  
##  NA's   :108957     NA's   :108957     NA's   :108957    NA's   :108957    
##  Weighted_Price    
##  Min.   :    0.06  
##  1st Qu.:  415.59  
##  Median :  909.92  
##  Mean   : 3208.52  
##  3rd Qu.: 6374.40  
##  Max.   :19891.99  
##  NA's   :108957

Data Cleaning

handling missing values

  • From the code below we fill missing values in Open High Low and Close columns and replace NAs with zero in Volume columns
# Fill NA values with zeroes for volume/trade-related fields using dplyr's mutate function
data <- data %>%
  mutate(`Volume_(BTC)` = ifelse(is.na(`Volume_(BTC)`), 0, `Volume_(BTC)`),  # Replace NA with 0 in 'Volume_(BTC)'
         `Volume_(Currency)` = ifelse(is.na(`Volume_(Currency)`), 0, `Volume_(Currency)`),  # Replace NA with 0 in 'Volume_(Currency)'
         Weighted_Price = ifelse(is.na(Weighted_Price), 0, Weighted_Price))  # Replace NA with 0 in 'Weighted_Price'

# Fill forward NA values (last observation carried forward) for OHLC data using zoo's na.locf function
data$Open <- zoo::na.locf(data$Open)    # Fill NA values in 'Open' column
data$High <- zoo::na.locf(data$High)    # Fill NA values in 'High' column
data$Low <- zoo::na.locf(data$Low)      # Fill NA values in 'Low' column
data$Close <- zoo::na.locf(data$Close)  # Fill NA values in 'Close' column

# Convert 'Timestamp' column from numeric Unix timestamps to POSIXct date-time objects in UTC timezone
data <- data %>%
  mutate(Timestamp = as.POSIXct(Timestamp, origin = "1970-01-01", tz = "UTC"))

# Display the first few rows of the updated dataframe 'data'
head(data)
## # A tibble: 6 × 8
##   Timestamp            Open  High   Low Close `Volume_(BTC)` `Volume_(Currency)`
##   <dttm>              <dbl> <dbl> <dbl> <dbl>          <dbl>               <dbl>
## 1 2014-12-01 05:33:00   300   300   300   300           0.01                   3
## 2 2014-12-01 05:34:00   300   300   300   300           0                      0
## 3 2014-12-01 05:35:00   300   300   300   300           0                      0
## 4 2014-12-01 05:36:00   300   300   300   300           0                      0
## 5 2014-12-01 05:37:00   300   300   300   300           0                      0
## 6 2014-12-01 05:38:00   300   300   300   300           0                      0
## # ℹ 1 more variable: Weighted_Price <dbl>

Getting weekly data

  • the code bleow groups the data by weeks where we create weekly groupings of every 7 days from the date 2015-01-01 00:00:00 to 2018-11-11 00:00:00. after this we display the weekly data created and stored as weekly_rows
# Create valid date range
start <- ymd_hms("2015-01-01 00:00:00", tz = "UTC")  # Define start date-time in UTC timezone
end <- ymd_hms("2018-11-11 00:00:00", tz = "UTC")    # Define end date-time in UTC timezone

# Find rows between start and end time in 'data' dataframe
filtered_data <- data %>%
  filter(Timestamp >= start & Timestamp <= end)  # Filter data for rows with Timestamp between 'start' and 'end'


# Find the first row of each week (starting on Monday) in filtered_data
weekly_rows <- filtered_data %>%
  mutate(Week = floor_date(Timestamp, unit = "week", week_start = 1)) %>% # Create 'Week' column by flooring Timestamp to the nearest week starting on Monday
  group_by(Week) %>%        # Group rows by 'Week' column
  slice(1) %>%               # Select the first row from each group (week)
  ungroup()                  # Remove grouping for further operations

# Display the first few rows of weekly_rows dataframe
head(weekly_rows)
## # A tibble: 6 × 9
##   Timestamp            Open  High   Low Close `Volume_(BTC)` `Volume_(Currency)`
##   <dttm>              <dbl> <dbl> <dbl> <dbl>          <dbl>               <dbl>
## 1 2015-01-08 01:24:00  360   360   360   360            0.01                 3.6
## 2 2015-01-13 03:38:00  260   260   260   260            1                  260  
## 3 2015-01-19 00:36:00  210   210   210   210            0.2                 42  
## 4 2015-01-26 00:00:00  255.  255.  255.  255.           0                    0  
## 5 2015-02-02 00:00:00  228.  229.  228.  229.           3.81               872. 
## 6 2015-02-09 00:00:00  224   224   224.  224.           0                    0  
## # ℹ 2 more variables: Weighted_Price <dbl>, Week <dttm>

Vzualizations

Lets visualize Historical Bitcoin Market Volume (2015-2018)-Yearly

in this code below we utilise plotly function from plotly package to create an interactive plot. the interactive plot takes in arguments of lable for 1month 6month and 12months to make the widgets more interactive.

Comparson of the python plot and R plot

python output R output

The python output doesent distinguish the open closed and weighted_price clearlyy in the plot. ALthtought ht plots interactively mathes that of R programming, we see a asuperiority of the R language in handling the visualizations in a more vasetile manner by creating multiple facaets for each column while still maintaining the same x and y axis.

the plots on the other hand show a rising trend in bitcoin and a drop in 2018 onwards.

# Create the traces for each line plot
trace1 <- plot_ly(data = weekly_rows, x = ~Timestamp, y = ~Open, type = 'scatter', mode = 'lines', name = 'Open')
trace2 <- plot_ly(data = weekly_rows, x = ~Timestamp, y = ~Close, type = 'scatter', mode = 'lines', name = 'Close')
trace3 <- plot_ly(data = weekly_rows, x = ~Timestamp, y = ~Weighted_Price, type = 'scatter', mode = 'lines', name = 'Weighted Avg')

# Combine the traces into a subplot with shared axes
fig <- subplot(trace1, trace2, trace3, nrows = 1, shareX = TRUE, shareY = TRUE)

# Update the layout of the plot with title, x-axis options, and range slider
fig1 <- fig %>% layout(
  title = 'Historical Bitcoin Prices (2015-2018) with the Slider',  # Set the title of the plot
  xaxis = list(
    rangeselector = list(
      buttons = list(
        list(count = 1, label = '1m', step = 'month', stepmode = 'backward'),  # Button for 1 month range
        list(count = 6, label = '6m', step = 'month', stepmode = 'backward'),  # Button for 6 months range
        list(count = 12, label = '1y', step = 'month', stepmode = 'backward'),  # Button for 1 year range
        list(count = 36, label = '3y', step = 'month', stepmode = 'backward'),  # Button for 3 years range
        list(step = 'all')  # Button for showing all data
      )
    ),
    rangeslider = list(visible = TRUE),  # Enable the range slider for zooming
    type = 'date'  # Set the type of x-axis to date
  )
)

# Plot the figure
fig1

Lets visualize Historical Bitcoin Prices (2015-2018) - Monthly

this code functions exactly as the previous one, The plot shows a spike in prices every october - december for each year. this can only be notice when hovering over the plots and the interactivity becomes active.

Comparison with python plot

python fig2

These two plot are similar theought hence both languages would suit the job for this code.

# Create the trace for Volume (Currency)
trace1 <- plot_ly(data = weekly_rows, x = ~Timestamp, y = ~`Volume_(Currency)`, type = 'scatter', mode = 'lines', name = 'Bitcoin Volume (USD)')

# Update layout with title, x-axis options, and range slider
fig2 <- trace1 %>% layout(
  title = 'Historical Bitcoin Volume (USD) (2015-2018) with the slider',  # Set the title of the plot
  xaxis = list(
    rangeselector = list(
      buttons = list(
        list(count = 1, label = '1m', step = 'month', stepmode = 'backward'),  # Button for 1 month range
        list(count = 6, label = '6m', step = 'month', stepmode = 'backward'),  # Button for 6 months range
        list(count = 12, label = '1y', step = 'month', stepmode = 'backward'),  # Button for 1 year range
        list(count = 36, label = '3y', step = 'month', stepmode = 'backward'),  # Button for 3 years range
        list(step = 'all')  # Button for showing all data
      )
    ),
    rangeslider = list(visible = TRUE),  # Enable the range slider for zooming
    type = 'date'  # Set the type of x-axis to date
  )
)

# Plot the figure
fig2

BTC Volume

This is a BTC Volume plot where we see the rlationship between the BTC volume and weighted price. There is a cluster of Volumes at specific prices.

History of Bitcoin Prices

  • 18 August 2008, the domain name bitcoin.org was registered.
  • November 6th 2010, Bitcoin share capital reaches 1 million USD. Its exchange rate on MtGox reaches USD 0.50 per BTC.
  • June 2nd 2011, USD to BTC rate is 10 USD to the coin. For 6 days, the Bitcoin value is fixed at 31.91 USD on MtGox.
  • February 28th 2013, Bitcoin exchange rate surpasses 31.91 USD for the first time for the last 601 days.
  • April 1st,2013 Exchange rate of Bitcoin reaches 100 USD to 1 BTC.
  • January,2015 Coinbase raised 75 million USD as part of a Series C funding round, smashing the previous record for a bitcoin company.
  • February ,2015 Bitcoin price reached USD 262.
  • January 2017,After the rally for most of the second half of 2016, bitcoin broke the USD 1,000 mark for the first time in 3 years.
  • June 12th 2017, Bitcoin exchange rate exceeds USD 3000 to the BTC.
  • November 29th 2017, Bitcoin price exceeds USD 10,000.
  • December 18th 2017, Bitcoin reaches a record high, but does not reach USD 20,000.
  • December 28th 2017, The price of bitcoins fell after South Korea announced additional measures to regulate bitcoin trading, including the potential closure of exchanges, among the volatile movements in the world’s third largest cryptocurrency market.
  • October 31st 2018, USD 6,300, on the 10 year anniversary of Bitcoin, price holds steady above USD 6,000 during a period of historically low volatility.

comparison with python plot

Volume plot from python
Volume plot from python
volume plot in R
volume plot in R

These two plots are simmilar making both languages easier to handle for this kind of plot.

library(plotly)     # Load the plotly package for interactive plots
library(dplyr)      # Load the dplyr package for data manipulation
library(lubridate)  # Load the lubridate package for handling date/time data

# Create the trace for BTC Volume vs USD
trace <- plot_ly(data = weekly_rows, x = ~Weighted_Price, y = ~`Volume_(BTC)`, 
                 type = 'scattergl', mode = 'markers', marker = list(color = '#FFBAD2', line = list(width = 1)))

# Update layout with title and axis labels
fig3 <- trace %>% layout(
  title = 'BTC Volume vs USD',  # Set the title of the plot
  xaxis = list(title = 'Weighted Price', titlefont = list(family = 'Courier New, monospace', size = 18, color = '#7f7f7f')),  # Set x-axis title and formatting
  yaxis = list(title = 'Volume BTC', titlefont = list(family = 'Courier New, monospace', size = 18, color = '#7f7f7f'))  # Set y-axis title and formatting
)

# Plot the figure
fig3

Timeserires Forecasting

Time Series Forecasting An experimental set of data that has been seen at various times is called a time series (typically equally spaced, as once a day, once an hour, or once a minute). A time series would be the daily sales data for airline tickets, for instance. Dates of significant airline tragedies, for example, are not time series since they are arbitrarily spaced; they are just a collection of occurrences with a time component. Point processes are a class of random processes.

Time series include seasonality, noise, and trend, among other important characteristics.The technique of predicting the future using data from the past and present is called forecasting. Here in this kernel, we attempt to perform Time Series Analysis on the Historic Bitcoin Price data. We can easily see from the Data Exploration section, that the Bitcoin prices were quite volatile and inconsistent over the years. Its very hard to perform Time series analysis on such volatile data. But here we try to explore the different Time series forecasting models. All the models used in this Kernel are very basic, there is scope of more complex and better performing models.

Prediction using LSTM

The Long Short-Term Memory (LSTM) model is a type of recurrent neural network (RNN) designed to capture long-term dependencies and manage the vanishing gradient problem that traditional RNNs often face. Here’s the mathematical formulation of the LSTM model:

LSTM Components

  1. Input Gate:
    • Input: \(x_t\) (input at time step \(t\))
    • Previous hidden state: \(h_{t-1}\)
    • Previous cell state: \(c_{t-1}\)
    • Activation function: Sigmoid \(\sigma\) and hyperbolic tangent \(\tanh\)
    • Update: \[ i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \]
  2. Forget Gate:
    • Input: \(x_t\), \(h_{t-1}\)
    • Activation function: Sigmoid \(\sigma\)
    • Update: \[ f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \]
  3. Cell State:
    • Input gate output: \(i_t\)
    • Forget gate output: \(f_t\)
    • Candidate cell state: \(\tilde{c}_t\) (proposed cell state update)
    • Update: \[ \tilde{c}_t = \tanh(W_c \cdot [h_{t-1}, x_t] + b_c) \] \[ c_t = f_t \cdot c_{t-1} + i_t \cdot \tilde{c}_t \]
  4. Output Gate:
    • Input: \(x_t\), \(h_{t-1}\)
    • Activation function: Sigmoid \(\sigma\) and hyperbolic tangent \(\tanh\)
    • Update: \[ o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \] \[ h_t = o_t \cdot \tanh(c_t) \]

Parameters

  • \(W_i, W_f, W_c, W_o\): Weight matrices for input, forget, cell, and output gates, respectively.
  • \(b_i, b_f, b_c, b_o\): Bias vectors for input, forget, cell, and output gates, respectively.
  • \([h_{t-1}, x_t]\): Concatenation of previous hidden state \(h_{t-1}\) and current input \(x_t\).

Function Explanation

  • Sigmoid Activation (σ): Scales input values between 0 and 1, controlling the flow of information through gates.
  • Hyperbolic Tangent (tanh): Squashes input values between -1 and 1, regulating the cell state update and output.

Data importaion

This segment is same as the previous codes

data <- read_csv("~/Git/Data/bitstampUSD_1-min_data_2012-01-01_to_2018-11-11.csv")  # Read CSV data into 'data' dataframe
data <- data %>% 
  mutate(Timestamp = as.POSIXct(Timestamp, origin = "1970-01-01", tz = "UTC"))  # Convert 'Timestamp' to POSIXct with UTC timezone

# Define a function to convert Unix timestamp to datetime with UTC timezone
dateparse <- function(time_in_secs) {
  as.POSIXct(time_in_secs, origin = "1970-01-01", tz = "UTC")
}

# Parse the 'Timestamp' column using 'dateparse' function
data <- data %>% 
  mutate(Timestamp = dateparse(Timestamp))

# Group by hour and take the first value in each group
data <- data %>% 
  mutate(Timestamp = floor_date(Timestamp, unit = "hour")) %>%  # Floor 'Timestamp' to the nearest hour
  group_by(Timestamp) %>%  # Group by 'Timestamp'
  slice(1) %>%  # Take the first row from each group
  ungroup() %>%  # Ungroup the data
  select(Timestamp, Weighted_Price)  # Select 'Timestamp' and 'Weighted_Price' columns

# Fill missing values in 'Weighted_Price' with the last available value
data <- data %>% 
  fill(Weighted_Price)  # Fill NA values in 'Weighted_Price' with the last non-NA value

# Print the first few rows of the dataframe
head(data)  # Display the first few rows of 'data'
## # A tibble: 6 × 2
##   Timestamp           Weighted_Price
##   <dttm>                       <dbl>
## 1 2011-12-31 07:00:00           4.39
## 2 2011-12-31 08:00:00           4.39
## 3 2011-12-31 09:00:00           4.39
## 4 2011-12-31 10:00:00           4.39
## 5 2011-12-31 11:00:00           4.39
## 6 2011-12-31 12:00:00           4.39

For the train and test, we take ‘25-Jun-2018’ as the split date. There was a considerable dip in Bitcoin prices between the June-July period 2018. If we check the historical prices the seasonal market started going up from this date after reaching the lowest, though the price reached much lower $5972 on June 29th 2018. After reaching the historic 20K mark on December 18th, there were several dips and market price was recorrected every time. Read more about Bitcoin dips.

# Convert 'Timestamp' column to POSIXct with UTC timezone
data$Timestamp <- as.POSIXct(data$Timestamp, origin = "1970-01-01", tz = "UTC")

# Define the split date
split_date <- as.POSIXct("2018-06-25", tz = "UTC")

# Split data into training and testing sets
data_train <- data %>% 
  filter(Timestamp <= split_date) %>%  # Filter rows where Timestamp is less than or equal to split_date
  select(Timestamp, Weighted_Price) %>%  # Select Timestamp and Weighted_Price columns
  mutate(Timestamp = as.character(Timestamp))  # Convert Timestamp to character for writing to CSV

  data_test <- data %>% 
  filter(Timestamp > split_date) %>%  # Filter rows where Timestamp is greater than split_date
  select(Timestamp, Weighted_Price) %>%  # Select Timestamp and Weighted_Price columns
  mutate(Timestamp = as.character(Timestamp))  # Convert Timestamp to character for writing to CSV

# Print head of training and testing sets
head(data_train)  # Display the first few rows of 'data_train'
## # A tibble: 6 × 2
##   Timestamp           Weighted_Price
##   <chr>                        <dbl>
## 1 2011-12-31 07:00:00           4.39
## 2 2011-12-31 08:00:00           4.39
## 3 2011-12-31 09:00:00           4.39
## 4 2011-12-31 10:00:00           4.39
## 5 2011-12-31 11:00:00           4.39
## 6 2011-12-31 12:00:00           4.39
head(data_test)   # Display the first few rows of 'data_test'
## # A tibble: 6 × 2
##   Timestamp           Weighted_Price
##   <chr>                        <dbl>
## 1 2018-06-25 01:00:00          6170.
## 2 2018-06-25 02:00:00          6148.
## 3 2018-06-25 03:00:00          6131.
## 4 2018-06-25 04:00:00          6139.
## 5 2018-06-25 05:00:00          6142.
## 6 2018-06-25 06:00:00          6155.
# Define the color palette
color_pal <- c("#F8766D", "#D39200", "#93AA00", "#00BA38", "#00C19F", "#00B9E3", "#619CFF", "#DB72FB")

# Plot using ggplot2
ggplot(data, aes(x = Timestamp, y = Weighted_Price)) +
  geom_line(color = color_pal[1]) +
  labs(title = "BTC Weighted_Price Price (USD) by Hours") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1)) +  # Adjust x-axis text angle and alignment
  scale_x_datetime(date_labels = "%b %Y", date_breaks = "1 month")  # Customize x-axis date labels and breaks

# Save the plot with adjusted size and layout
ggsave("btc_price_plo0t.png", width = 12, height = 6, dpi = 300)  # Adjust width and height as needed
# Convert 'Timestamp' to POSIXct for proper datetime handling
data_test$Timestamp <- as.POSIXct(data_test$Timestamp, origin = "1970-01-01", tz = "UTC")
data_train$Timestamp <- as.POSIXct(data_train$Timestamp, origin = "1970-01-01", tz = "UTC")

# Rename columns for clarity
data_test <- data_test %>% 
  rename(`Test Set` = Weighted_Price)  # Rename 'Weighted_Price' column in data_test to 'Test Set'

data_train <- data_train %>% 
  rename(`Training Set` = Weighted_Price)  # Rename 'Weighted_Price' column in data_train to 'Training Set'

# Join data_test and data_train
joined_data <- full_join(data_test, data_train, by = "Timestamp")  # Perform a full join on 'Timestamp'

# Plot using ggplot2
ggplot(joined_data, aes(x = Timestamp)) +  # Define x-axis as 'Timestamp'
  geom_line(aes(y = `Test Set`), color = "#F8766D") +  # Add line plot for 'Test Set' with specified color
  geom_line(aes(y = `Training Set`), color = "#D39200") +  # Add line plot for 'Training Set' with specified color
  labs(title = "BTC Weighted_Price Price (USD) by years") +  # Set plot title
  theme_minimal() +  # Apply minimal theme
  theme(plot.title = element_text(hjust = 0.5)) +  # Center the plot title
  xlab("Date") +  # Label x-axis
  ylab("Weighted Price (USD)")  # Label y-axis

We will use a Vanilla LSTM here for forecasting. The model is trained on pre 25-Jun-2018 data.

library(ggplot2)

# Data Preprocessing
data <- na.omit(data)
data$hour <- as.POSIXct(format(data$Timestamp, "%Y-%m-%d %H:00:00"), tz = "UTC")
data <- data[!duplicated(data$hour), ]

fill_missing <- function(values) {
  last_value <- NA
  filled_values <- numeric(length(values))
  for (i in seq_along(values)) {
    if (is.na(values[i])) {
      filled_values[i] <- last_value
    } else {
      filled_values[i] <- values[i]
      last_value <- values[i]
    }
  }
  return(filled_values)
}

data$Weighted_Price <- fill_missing(data$Weighted_Price)

split_date <- as.POSIXct("2018-06-25", tz = "UTC")
data_train <- data[data$hour <= split_date, "Weighted_Price", drop = FALSE]
data_test <- data[data$hour > split_date, "Weighted_Price", drop = FALSE]

scale_data <- function(values) {
  mean_val <- mean(values, na.rm = TRUE)
  std_val <- sd(values, na.rm = TRUE)
  scaled_values <- (values - mean_val) / std_val
  list(scaled_values = scaled_values, mean = mean_val, std = std_val)
}

scaled_train <- scale_data(data_train$Weighted_Price)
scaled_test <- scale_data(data_test$Weighted_Price)  # Apply same scaling to test data
training_set <- scaled_train$scaled_values
testing_set <- scaled_test$scaled_values

X_train <- training_set[-length(training_set)]
y_train <- training_set[-1]
y_train <- matrix(y_train, ncol = 1)

X_train <- array(X_train, dim = c(length(X_train), 1, 1))

# LSTM Model Definition
sigmoid <- function(x) {
  1 / (1 + exp(-x))
}

tanh_act <- function(x) {
  tanh(x)
}

initialize_weights <- function(input_size, output_size) {
  matrix(rnorm(input_size * output_size), nrow = input_size, ncol = output_size)
}

SimpleLSTM <- function(input_shape, units) {
  list(
    units = units,
    Wf = initialize_weights(input_shape[2], units),
    Wi = initialize_weights(input_shape[2], units),
    Wc = initialize_weights(input_shape[2], units),
    Wo = initialize_weights(input_shape[2], units),
    bf = rep(0, units),
    bi = rep(0, units),
    bc = rep(0, units),
    bo = rep(0, units),
    Wy = initialize_weights(units, 1),
    by = 0
  )
}

forward <- function(model, x) {
  h_prev <- rep(0, model$units)
  c_prev <- rep(0, model$units)
  for (t in 1:dim(x)[1]) {
    xt <- x[t, , , drop = FALSE]
    f_t <- sigmoid(xt %*% model$Wf + model$bf)
    i_t <- sigmoid(xt %*% model$Wi + model$bi)
    c_tilde_t <- tanh_act(xt %*% model$Wc + model$bc)
    c_t <- f_t * c_prev + i_t * c_tilde_t
    o_t <- sigmoid(xt %*% model$Wo + model$bo)
    h_t <- o_t * tanh_act(c_t)
    h_prev <- h_t
    c_prev <- c_t
  }
  y <- h_t %*% model$Wy + model$by
  y
}

train <- function(model, X, y, epochs, lr) {
  for (epoch in 1:epochs) {
    loss <- 0
    for (i in 1:dim(X)[1]) {
      output <- forward(model, X[i, , , drop = FALSE])
      error <- output - y[i, ]
      loss <- loss + sum(error^2)
      # Backpropagation would be added here
    }
    if (epoch %% 10 == 0) {
      cat("Epoch", epoch, "Loss:", loss, "\n")
    }
  }
}

# Prediction Function
predict_lstm <- function(model, X) {
  predictions <- numeric(dim(X)[1])
  for (i in 1:dim(X)[1]) {
    predictions[i] <- forward(model, X[i, , , drop = FALSE])
  }
  return(predictions)
}

input_shape <- c(1, 1)
units <- 128
model <- SimpleLSTM(input_shape, units)
train(model, X_train, y_train, epochs = 100, lr = 0.001)
## Epoch 10 Loss: 278978.5 
## Epoch 20 Loss: 278978.5 
## Epoch 30 Loss: 278978.5 
## Epoch 40 Loss: 278978.5 
## Epoch 50 Loss: 278978.5 
## Epoch 60 Loss: 278978.5 
## Epoch 70 Loss: 278978.5 
## Epoch 80 Loss: 278978.5 
## Epoch 90 Loss: 278978.5 
## Epoch 100 Loss: 278978.5
# Prepare test data for prediction
X_test <- array(testing_set[-length(testing_set)], dim = c(length(testing_set) - 1, 1, 1))

# Make predictions
predictions <- predict_lstm(model, X_test)

# Rescale predictions to original scale
predictions_rescaled <- predictions * scaled_train$std + scaled_train$mean
actual_rescaled <- testing_set[-1] * scaled_train$std + scaled_train$mean

# Prepare data for plotting
plot_data <- data.frame(
  #Timestamp = data[data$hour > split_date, "hour"][-1],
  Actual = actual_rescaled,
  Predicted = predictions_rescaled
)


# Plotting
ggplot(plot_data, aes(x = data$Timestamp[1:length(plot_data$Actual)] )) +
  geom_line(aes(y = Actual, color = "Actual")) +
  geom_line(aes(y = Predicted, color = "Predicted")) +
  labs(title = "LSTM Predictions vs Actual",
       x = "Time",
       y = "Weighted Price") +
  scale_color_manual(values = c("Actual" = "blue", "Predicted" = "red")) +
  theme_minimal()
## Warning: Use of `plot_data$Actual` is discouraged.
## ℹ Use `Actual` instead.
## Use of `plot_data$Actual` is discouraged.
## ℹ Use `Actual` instead.

summary(model)
##       Length Class  Mode   
## units   1    -none- numeric
## Wf    128    -none- numeric
## Wi    128    -none- numeric
## Wc    128    -none- numeric
## Wo    128    -none- numeric
## bf    128    -none- numeric
## bi    128    -none- numeric
## bc    128    -none- numeric
## bo    128    -none- numeric
## Wy    128    -none- numeric
## by      1    -none- numeric

comparison with ppython

XGBOOST

XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. XGBoost is a powerful and versatile tool. Lets see, How well does XGBoost perform when used to predict future values of a time-series like Bitcoin prices ?

cat("\014")
# Clear the environment (remove all objects)
rm(list = ls())

LOAD PACKAGES AFRESH

# Define the list of package names
packages <- c("tidyverse", "plotly", "lubridate", "readr", "dplyr", "zoo", "ggplot2", "forecast", "stats", "tseries")
# Use lapply to load each package
lapply(packages, library, character.only = TRUE)
# Define a function to parse timestamps from seconds since the Unix epoch
dateparse <- function(time_in_secs) {
  as.POSIXct(as.numeric(time_in_secs), origin="1970-01-01", tz="UTC")  # Convert seconds to POSIXct date-time format
}

# Replace with your actual file path to the CSV file
file_path <- '~/Git/Data/coinbaseUSD_1-min_data_2014-12-01_to_2018-11-11.csv'

# Read the CSV file into a data frame
data <- read.csv(file_path)

# Convert the 'Timestamp' column from Unix time to POSIXct date-time format
data <- data %>%
  mutate(Timestamp = as.POSIXct(Timestamp, origin = "1970-01-01", tz = "UTC"))

missing values and cleaning

# Fill NA values with zeroes for volume/trade-related fields using dplyr's mutate function
data <- data %>%
  mutate(Volume_.BTC. = ifelse(is.na(Volume_.BTC.), 0, Volume_.BTC.),  # Replace NA with 0 in 'Volume_(BTC)'
         Volume_.Currency. = ifelse(is.na(Volume_.Currency.), 0, Volume_.Currency.),  # Replace NA with 0 in 'Volume_(Currency)'
         Weighted_Price = ifelse(is.na(Weighted_Price), 0, Weighted_Price))  # Replace NA with 0 in 'Weighted_Price'

# Fill forward NA values (last observation carried forward) for OHLC data using zoo's na.locf function
data$Open <- zoo::na.locf(data$Open)    # Fill NA values in 'Open' column
data$High <- zoo::na.locf(data$High)    # Fill NA values in 'High' column
data$Low <- zoo::na.locf(data$Low)      # Fill NA values in 'Low' column
data$Close <- zoo::na.locf(data$Close)  # Fill NA values in 'Close' column

# Convert 'Timestamp' column from numeric Unix timestamps to POSIXct date-time objects in UTC timezone
data <- data %>%
  mutate(Timestamp = as.POSIXct(Timestamp, origin = "1970-01-01", tz = "UTC"))

# Display the first few rows of the updated dataframe 'data'
head(data)# Fill NA values with zeroes for volume/trade-related fields using dplyr's mutate function
##             Timestamp Open High Low Close Volume_.BTC. Volume_.Currency.
## 1 2014-12-01 05:33:00  300  300 300   300         0.01                 3
## 2 2014-12-01 05:34:00  300  300 300   300         0.00                 0
## 3 2014-12-01 05:35:00  300  300 300   300         0.00                 0
## 4 2014-12-01 05:36:00  300  300 300   300         0.00                 0
## 5 2014-12-01 05:37:00  300  300 300   300         0.00                 0
## 6 2014-12-01 05:38:00  300  300 300   300         0.00                 0
##   Weighted_Price
## 1            300
## 2              0
## 3              0
## 4              0
## 5              0
## 6              0
data <- data %>%
  mutate(Volume_.BTC. = ifelse(is.na(Volume_.BTC.), 0, Volume_.BTC.),  # Replace NA with 0 in 'Volume_(BTC)'
         Volume_.Currency. = ifelse(is.na(Volume_.Currency.), 0, Volume_.Currency.),  # Replace NA with 0 in 'Volume_(Currency)'
         Weighted_Price = ifelse(is.na(Weighted_Price), 0, Weighted_Price))  # Replace NA with 0 in 'Weighted_Price'

# Fill forward NA values (last observation carried forward) for OHLC data using zoo's na.locf function
data$Open <- zoo::na.locf(data$Open)    # Fill NA values in 'Open' column
data$High <- zoo::na.locf(data$High)    # Fill NA values in 'High' column
data$Low <- zoo::na.locf(data$Low)      # Fill NA values in 'Low' column
data$Close <- zoo::na.locf(data$Close)  # Fill NA values in 'Close' column

# Convert 'Timestamp' column from numeric Unix timestamps to POSIXct date-time objects in UTC timezone
data <- data %>%
  mutate(Timestamp = as.POSIXct(Timestamp, origin = "1970-01-01", tz = "UTC"))

# Group by hourly frequency, take the first observation per hour
data <- data %>% 
  mutate(Timestamp = floor_date(Timestamp, unit = "hour")) %>%  # Floor 'Timestamp' to the nearest hour
  group_by(Timestamp) %>%  # Group by 'Timestamp'
  slice(1) %>%  # Take the first row from each group
  ungroup() %>%  # Ungroup the data
  select(Timestamp, Weighted_Price)  # Select 'Timestamp' and 'Weighted_Price' columns


# Display the first few rows of the updated dataframe 'data'
head(data)
## # A tibble: 6 × 2
##   Timestamp           Weighted_Price
##   <dttm>                       <dbl>
## 1 2014-12-01 05:00:00            300
## 2 2014-12-01 06:00:00              0
## 3 2014-12-01 07:00:00              0
## 4 2014-12-01 08:00:00              0
## 5 2014-12-01 09:00:00              0
## 6 2014-12-01 10:00:00              0
# Load necessary libraries
library(dplyr)
library(zoo)

# Assuming 'data' is already loaded with 'Timestamp' and 'Weighted_Price' columns

# Remove timezone information (if present)
data$Timestamp <- as.POSIXct(data$Timestamp, tz = "")  # This step removes timezone information

# Group by hourly frequency, take the first observation per hour
data <- data %>%
  mutate(Timestamp = floor_date(Timestamp, unit = "hour")) %>%  # Floor 'Timestamp' to the nearest hour
  group_by(Timestamp) %>%  # Group by 'Timestamp'
  slice(1) %>%  # Take the first row from each group
  ungroup()  # Ungroup the data

# Fill missing values using forward fill (na.locf from zoo package)
data$Weighted_Price <- na.locf(data$Weighted_Price)

# Print or return processed data
head(data)
## # A tibble: 6 × 2
##   Timestamp           Weighted_Price
##   <dttm>                       <dbl>
## 1 2014-12-01 08:00:00            300
## 2 2014-12-01 09:00:00              0
## 3 2014-12-01 10:00:00              0
## 4 2014-12-01 11:00:00              0
## 5 2014-12-01 12:00:00              0
## 6 2014-12-01 13:00:00              0
# Load necessary libraries
library(ggplot2)

# Define color palette
color_pal <- c("#F8766D", "#D39200", "#93AA00", "#00BA38", "#00C19F", "#00B9E3", "#619CFF", "#DB72FB")

# Plot using ggplot2
ggplot(data, aes(x = Timestamp, y = Weighted_Price)) +
  geom_line(color = color_pal[1]) +  # Adjust color based on your preference
  labs(title = "BTC Weighted_Price Price (USD) by Hours") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +  # Rotate x-axis labels if needed
  scale_x_datetime(date_labels = "%Y-%m-%d %H:%M") +  # Adjust date format on x-axis
  labs(x = "Timestamp", y = "Weighted_Price (USD)") +
  NULL  # Suppress the printing of the plot object in console

test and train data

# Load necessary libraries
library(dplyr)  # For data manipulation
library(ggplot2)  # For data visualization

# Convert 'Timestamp' to Date format (if it's POSIXct)
data$Timestamp <- as.Date(data$Timestamp)  # Ensure the Timestamp column is in Date format

# Define split date for training and testing sets
split_date <- as.Date("2018-06-25")  # The date to split the data into training and testing sets

# Split data into training set (before or on split_date) and select relevant columns
data_train <- data %>% 
  filter(Timestamp <= split_date) %>%  # Filter rows where Timestamp is on or before the split date
  select(Timestamp, Weighted_Price)  # Select only the Timestamp and Weighted_Price columns

# Split data into testing set (after split_date) and select relevant columns
data_test <- data %>% 
  filter(Timestamp > split_date) %>%  # Filter rows where Timestamp is after the split date
  select(Timestamp, Weighted_Price)  # Select only the Timestamp and Weighted_Price columns

# Plot the data using ggplot2
ggplot() +
  geom_line(data = data_test, aes(x = Timestamp, y = Weighted_Price, color = "Test Set")) +  # Plot the testing set with a line
  geom_line(data = data_train, aes(x = Timestamp, y = Weighted_Price, color = "Training Set")) +  # Plot the training set with a line
  labs(title = "BTC Weighted_Price Price (USD) by Hours") +  # Add a title to the plot
  theme_minimal() +  # Use a minimal theme for the plot
  scale_x_date(date_labels = "%Y-%m-%d %H:%M", date_breaks = "1 day") +  # Format the x-axis with dates and set breaks at every day
  labs(x = "Timestamp", y = "Weighted_Price (USD)") +  # Label the x and y axes
  scale_color_manual(values = c("Test Set" = "#F8766D", "Training Set" = "#00B9E3"))  # Set custom colors for the lines representing the test and training sets

library(xgboost)
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:plotly':
## 
##     slice
## The following object is masked from 'package:dplyr':
## 
##     slice
# Convert 'Timestamp' column to numeric format (seconds since the Unix epoch) if necessary
data_train$Timestamp <- as.numeric(as.POSIXct(data_train$Timestamp))

# Convert the data frame to matrix format for XGBoost
X_train <- as.matrix(data_train[, c("Timestamp", "Weighted_Price")])  # Select 'Timestamp' and 'Weighted_Price' columns and convert to matrix
y_train <- data_train$Weighted_Price  # Extract the target variable 'Weighted_Price'

# Create DMatrix, a special data structure in XGBoost for training
dtrain <- xgb.DMatrix(data = X_train, label = y_train)  # Create a DMatrix with data and labels
dtest <- xgb.DMatrix(data = X_train, label = y_train)
# Define parameters for the XGBoost model
params <- list(
  objective = "reg:squarederror",  # Specify the objective function type for regression
  min_child_weight = 10,  # Minimum sum of instance weight (hessian) needed in a child
  booster = "gbtree",  # Use tree-based boosters
  colsample_bytree = 0.3,  # Fraction of features to be used for each tree
  learning_rate = 0.1,  # Step size shrinkage to prevent overfitting
  max_depth = 5,  # Maximum depth of a tree
  alpha = 10  # L1 regularization term on weights (lasso regression)
)

# Train the XGBoost model
model.boost <- xgb.train(
  params = params,  # Model parameters
  data = dtrain,  # Training data
  nrounds = 100,  # Number of boosting rounds
  early_stopping_rounds = 50,  # Stop training if the performance doesn't improve for 50 rounds
  maximize = FALSE,  # Minimize the evaluation metric (since this is regression)
  watchlist = list(train = dtrain),  # List of data to evaluate on during training
  verbose = 0  # Set to 1 to print training progress, 0 for silent mode
)

summary(model.boost)
##                 Length Class              Mode       
## handle               1 xgb.Booster.handle externalptr
## raw             237062 -none-             raw        
## best_iteration       1 -none-             numeric    
## best_ntreelimit      1 -none-             numeric    
## best_score           1 -none-             numeric    
## best_msg             1 -none-             character  
## niter                1 -none-             numeric    
## evaluation_log       2 data.table         list       
## call                 8 -none-             call       
## params               8 -none-             list       
## callbacks            2 -none-             list       
## feature_names        2 -none-             character  
## nfeatures            1 -none-             numeric
#model.boost$evaluation_log
model.boost$params
## $objective
## [1] "reg:squarederror"
## 
## $min_child_weight
## [1] 10
## 
## $booster
## [1] "gbtree"
## 
## $colsample_bytree
## [1] 0.3
## 
## $learning_rate
## [1] 0.1
## 
## $max_depth
## [1] 5
## 
## $alpha
## [1] 10
## 
## $validate_parameters
## [1] TRUE
model.boost$best_score
## [1] 216.5385

MAE and MSE

# Make predictions on training data
train_pred <- predict(model.boost, dtrain)

# Get actual values from training data
train_actual <- as.numeric(getinfo(dtrain, "label"))  # Extract the actual labels from dtrain

# Calculate Mean Squared Error (MSE)
train_mse <- mean((train_actual - train_pred)^2)

# Calculate Mean Absolute Error (MAE)
train_mae <- mean(abs(train_actual - train_pred))

# Print the results
cat("Training MSE:", train_mse, "\n")
## Training MSE: 46888.94
cat("Training MAE:", train_mae, "\n")
## Training MAE: 54.29731

prediciotn

# Load necessary libraries
library(xgboost)  # For XGBoost model
library(ggplot2)  # For plotting

# Convert 'Timestamp' column to numeric format (seconds since the Unix epoch) if necessary
data_train$Timestamp <- as.numeric(as.POSIXct(data_train$Timestamp))

# Convert the data frame to matrix format for XGBoost
X_train <- as.matrix(data_train[, c("Timestamp", "Weighted_Price")])  # Select 'Timestamp' and 'Weighted_Price' columns and convert to matrix
y_train <- data_train$Weighted_Price  # Extract the target variable 'Weighted_Price'

# Create DMatrix, a special data structure in XGBoost for training
dtrain <- xgb.DMatrix(data = X_train, label = y_train)  # Create a DMatrix with data and labels

# Define parameters for the XGBoost model
params <- list(
  objective = "reg:squarederror",  # Specify the objective function type for regression
  min_child_weight = 10,  # Minimum sum of instance weight (hessian) needed in a child
  booster = "gbtree",  # Use tree-based boosters
  colsample_bytree = 0.3,  # Fraction of features to be used for each tree
  learning_rate = 0.1,  # Step size shrinkage to prevent overfitting
  max_depth = 5,  # Maximum depth of a tree
  alpha = 10  # L1 regularization term on weights (lasso regression)
)

# Train the XGBoost model
model <- xgb.train(
  params = params,  # Model parameters
  data = dtrain,  # Training data
  nrounds = 100,  # Number of boosting rounds
  early_stopping_rounds = 50,  # Stop training if the performance doesn't improve for 50 rounds
  maximize = FALSE,  # Minimize the evaluation metric (since this is regression)
  watchlist = list(train = dtrain),  # List of data to evaluate on during training
  verbose = 0  # Set to 1 to print training progress, 0 for silent mode
)

# Make predictions on the training data
predictions <- predict(model, dtrain)

# Convert 'Timestamp' back to POSIXct format for plotting
data_train$Timestamp <- as.POSIXct(data_train$Timestamp, origin="1970-01-01", tz="UTC")

# Combine actual and predicted values into a single data frame
results <- data.frame(
  Timestamp = data_train$Timestamp,
  Actual = y_train,
  Predicted = predictions
)

# Plot actual vs predicted values using ggplot2
ggplot(results, aes(x = Timestamp)) +
  geom_line(aes(y = Actual, color = "Actual")) +  # Plot actual values
  geom_line(aes(y = Predicted, color = "Predicted")) +  # Plot predicted values
  labs(title = "Actual vs Predicted Weighted Price",
       x = "Timestamp",
       y = "Weighted Price (USD)") +
  scale_color_manual(values = c("Actual" = "blue", "Predicted" = "red")) +  # Define colors for the lines
  theme_minimal()  # Apply a minimal theme

forecasting ahead

FORECAST IN PYTHON COMPARISON

python forecast
python forecast
# Load necessary libraries
library(xgboost)  # For XGBoost model
library(ggplot2)  # For plotting
library(dplyr)    # For data manipulation

# Convert 'Timestamp' column to numeric format (seconds since the Unix epoch) if necessary
data$Timestamp <- as.numeric(as.POSIXct(data$Timestamp))

# Split data into training and testing sets
split_date <- as.numeric(as.POSIXct("2018-06-25"))
data_train <- data %>% filter(Timestamp <= split_date)
data_test <- data %>% filter(Timestamp > split_date)

# Convert to matrix format for XGBoost
X_train <- as.matrix(data_train[, c("Timestamp", "Weighted_Price")])
y_train <- data_train$Weighted_Price

X_test <- as.matrix(data_test[, c("Timestamp", "Weighted_Price")])
y_test <- data_test$Weighted_Price

# Create DMatrix for training and testing
dtrain <- xgb.DMatrix(data = X_train, label = y_train)
dtest <- xgb.DMatrix(data = X_test, label = y_test)

# Define parameters for the XGBoost model
params <- list(
  objective = "reg:squarederror",  # Specify the objective function type for regression
  min_child_weight = 10,  # Minimum sum of instance weight (hessian) needed in a child
  booster = "gbtree",  # Use tree-based boosters
  colsample_bytree = 0.3,  # Fraction of features to be used for each tree
  learning_rate = 0.1,  # Step size shrinkage to prevent overfitting
  max_depth = 5,  # Maximum depth of a tree
  alpha = 10  # L1 regularization term on weights (lasso regression)
)

# Train the XGBoost model
model <- xgb.train(
  params = params,  # Model parameters
  data = dtrain,  # Training data
  nrounds = 100,  # Number of boosting rounds
  early_stopping_rounds = 50,  # Stop training if the performance doesn't improve for 50 rounds
  maximize = FALSE,  # Minimize the evaluation metric (since this is regression)
  watchlist = list(train = dtrain),  # List of data to evaluate on during training
  verbose = 0  # Set to 1 to print training progress, 0 for silent mode
)

# Make predictions on the training data
predictions_train <- predict(model, dtrain)

# Make predictions on the test data
predictions_test <- predict(model, dtest)

# Convert 'Timestamp' back to POSIXct format for plotting
data_train$Timestamp <- as.POSIXct(data_train$Timestamp, origin="1970-01-01", tz="UTC")
data_test$Timestamp <- as.POSIXct(data_test$Timestamp, origin="1970-01-01", tz="UTC")

# Combine actual and predicted values into a single data frame
results_train <- data.frame(
  Timestamp = data_train$Timestamp,
  Actual = y_train,
  Predicted = predictions_train
)

results_test <- data.frame(
  Timestamp = data_test$Timestamp,
  Actual = y_test,
  Predicted = predictions_test
)

# Plot actual vs predicted values using ggplot2
ggplot() +
  geom_line(data = results_train, aes(x = Timestamp, y = Actual, color = "Actual Train")) +  # Plot actual values for training data
  geom_line(data = results_train, aes(x = Timestamp, y = Predicted, color = "Predicted Train")) +  # Plot predicted values for training data
  geom_line(data = results_test, aes(x = Timestamp, y = Actual, color = "Actual Test")) +  # Plot actual values for test data
  geom_line(data = results_test, aes(x = Timestamp, y = Predicted, color = "Predicted Test")) +  # Plot predicted values for test data
  labs(title = "Actual vs Predicted Weighted Price",
       x = "Timestamp",
       y = "Weighted Price (USD)") +
  scale_color_manual(values = c("Actual Train" = "blue", "Predicted Train" = "red", "Actual Test" = "green", "Predicted Test" = "purple")) +  # Define colors for the lines
  theme_minimal()  # Apply a minimal theme

XGBOOST MODEL EXPLINATONS

IN R

The XGBoost model’s output metrics indicate its performance on the training data.

Training MSE (17614.23): Mean Squared Error (MSE) measures the average of the squared differences between the predicted and actual values. It is more sensitive to large errors due to the squaring operation, providing a higher penalty for larger errors. An MSE of 17614.23 indicates the average squared prediction error.

Training MAE (33.55001): Mean Absolute Error (MAE) calculates the average of the absolute differences between predicted and actual values, giving equal weight to all errors. An MAE of 33.55001 signifies the average absolute error in predictions.

Both metrics are essential for evaluating model performance, with MSE being more influenced by larger errors and MAE providing a straightforward average error magnitude. Lower values for both metrics indicate a better fit of the model to the training data.

IN PYTHON COMPARISON

The significant difference in Mean Squared Error (MSE) and Mean Absolute Error (MAE) between the XGBoost models run in R and Python suggests that one of the implementations might have issues or differences in data processing or model setup. Here’s an analysis to help decide which language might be better based on these metrics:

Metrics:

  • R Metrics:
    • MSE: 17614.23
    • MAE: 33.55001
  • Python Metrics:
    • MSE: 484476.69
    • MAE: 474.81

Interpretation:

  • MSE (Mean Squared Error) measures the average squared difference between the actual and predicted values. A lower MSE indicates better model performance.
  • MAE (Mean Absolute Error) measures the average absolute difference between actual and predicted values. Like MSE, a lower MAE indicates better performance.

Observations:

  • The model in R has significantly lower MSE and MAE values compared to the Python model.
  • This suggests that the R model is performing much better, predicting closer to the actual values than the Python model.

Possible Reasons for Differences:

  1. Data Preprocessing: There might be differences in how missing values, scaling, or data types are handled in R versus Python.

Conclusion:

Based on the provided metrics, the R implementation appears to perform significantly better than the Python implementation. Therefore, for this specific task and data:

  • R would be the better choice given the lower error metrics, indicating a better fit to the data.

PROPHET FORECASTING

library(dplyr)  # For data manipulation
library(zoo)  # For the na.locf function (Last Observation Carried Forward)
library(lubridate)  # For handling date/time data

# Assuming 'data' is your data frame in R with 'Timestamp' and 'Weighted_Price'
data <- data %>%
  mutate(Timestamp = as.POSIXct(Timestamp, origin = "1970-01-01", tz = "UTC"))

# Step 2: Group by hourly frequency and take the first value within each hour
data <- data %>%
  group_by(Timestamp = floor_date(Timestamp, unit = "hour")) %>%  # Group data by hour (round down to the nearest hour)
  summarise(Weighted_Price = first(Weighted_Price)) %>%  # Summarize by taking the first 'Weighted_Price' within each hour
  ungroup()  # Ungroup the data

# Step 3: Set 'Timestamp' as the index and handle missing values
data <- data %>%
  mutate(Timestamp = as.POSIXct(Timestamp)) %>%  # Ensure 'Timestamp' is in POSIXct format
  arrange(Timestamp) %>%  # Arrange the data by 'Timestamp' in ascending order
  select(Timestamp, Weighted_Price) %>%  # Select only the 'Timestamp' and 'Weighted_Price' columns
  mutate(Weighted_Price = na.locf(Weighted_Price))  # Fill missing values by carrying the last observation forward

# Print the resulting data frame
print(data)  # Output the cleaned and processed data frame
## # A tibble: 1,408 × 2
##    Timestamp           Weighted_Price
##    <dttm>                       <dbl>
##  1 2014-12-01 00:00:00           300 
##  2 2014-12-02 00:00:00             0 
##  3 2014-12-03 00:00:00             0 
##  4 2014-12-04 00:00:00             0 
##  5 2014-12-06 00:00:00           378 
##  6 2014-12-08 00:00:00           375.
##  7 2014-12-10 00:00:00           398 
##  8 2014-12-12 00:00:00           379 
##  9 2014-12-18 00:00:00           342.
## 10 2015-01-08 00:00:00           360 
## # ℹ 1,398 more rows
library(ggplot2)

# Define color palette
color_pal <- c("#F8766D", "#D39200", "#93AA00", "#00BA38", "#00C19F", "#00B9E3", "#619CFF", "#DB72FB")



# Plot using ggplot2
ggplot(data, aes(x = Timestamp, y = Weighted_Price)) +
  geom_line(color = color_pal[1]) +  # Adjust color_pal index as needed
  labs(title = "BTC Weighted_Price Price (USD) by Hours") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +  # Rotate x-axis labels for better readability
  scale_x_datetime(date_breaks = "1 hour", date_labels = "%Y-%m-%d %H:%M")  # Set breaks and format for datetime x-axis

# Assuming 'data' is your dataset with 'Timestamp' as a column

# Convert 'Timestamp' to POSIXct (datetime) format if not already in that format
data$Timestamp <- as.POSIXct(data$Timestamp, format = "%d-%b-%Y")

# Define split_date in POSIXct format
split_date <- as.POSIXct("2018-06-25")

# Create data_train and data_test
data_train <- data[data$Timestamp <= split_date, ]
data_test <- data[data$Timestamp > split_date, ]
library(dplyr)
library(ggplot2)
library(lubridate)

# Assuming 'data' is your dataframe with 'Timestamp' (POSIXct) and 'Weighted_Price'



# Group and summarize by hour, ensuring no missing values
data_hourly <- data %>%
  group_by(Timestamp = floor_date(Timestamp, "hour")) %>%
  summarize(Weighted_Price = mean(Weighted_Price, na.rm = TRUE)) %>%
  ungroup()

# Plotting with ggplot2
ggplot(data_hourly, aes(x = Timestamp, y = Weighted_Price)) +
  geom_line(color = "#F8766D") +
  labs(title = "BTC Weighted_Price Price (USD) by Hours", x = "Timestamp", y = "Price") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_datetime(date_labels = "%d-%b %H:%M", date_breaks = "1 hour") +
  scale_color_manual(values = c("#F8766D"))

MODEL

# Load required libraries
library(prophet)
## Loading required package: Rcpp
## Loading required package: rlang
## 
## Attaching package: 'rlang'
## The following objects are masked from 'package:purrr':
## 
##     %@%, flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
##     flatten_raw, invoke, splice
library(dplyr)

# Assuming 'data_train' is your dataframe with columns 'Timestamp' and 'Weighted_Price'

# Convert data_train to proper format and rename columns
data_train <- data_train %>%
  mutate(ds = Timestamp, y = Weighted_Price) %>%
  select(ds, y)

# Setup and train model
model <- prophet()  # Create a new Prophet model object
model_fit <- fit.prophet(model, data_train)  # Fit the model to the training data
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
# Optionally, make future predictions
future <- data.frame(ds = seq(min(data_train$ds), max(data_train$ds) + days(365), by = "day"))
future$ds <- as.POSIXct(future$ds)
summary(model_fit$extra_regressors)
## Length  Class   Mode 
##      0   list   list
summary(model)
##                         Length Class  Mode     
## growth                  1      -none- character
## changepoints            0      -none- NULL     
## n.changepoints          1      -none- numeric  
## changepoint.range       1      -none- numeric  
## yearly.seasonality      1      -none- character
## weekly.seasonality      1      -none- character
## daily.seasonality       1      -none- character
## holidays                0      -none- NULL     
## seasonality.mode        1      -none- character
## seasonality.prior.scale 1      -none- numeric  
## changepoint.prior.scale 1      -none- numeric  
## holidays.prior.scale    1      -none- numeric  
## mcmc.samples            1      -none- numeric  
## interval.width          1      -none- numeric  
## uncertainty.samples     1      -none- numeric  
## specified.changepoints  1      -none- logical  
## start                   0      -none- NULL     
## y.scale                 0      -none- NULL     
## logistic.floor          1      -none- logical  
## t.scale                 0      -none- NULL     
## changepoints.t          0      -none- NULL     
## seasonalities           0      -none- list     
## extra_regressors        0      -none- list     
## country_holidays        0      -none- NULL     
## stan.fit                0      -none- NULL     
## params                  0      -none- list     
## history                 0      -none- NULL     
## history.dates           0      -none- NULL     
## train.holiday.names     0      -none- NULL     
## train.component.cols    0      -none- NULL     
## component.modes         0      -none- NULL     
## fit.kwargs              0      -none- list
head(predict(model_fit))
## # A tibble: 6 × 19
##   ds                   trend additive_terms additive_terms_lower
##   <dttm>               <dbl>          <dbl>                <dbl>
## 1 2014-12-01 00:00:00 -2717.          2089.                2089.
## 2 2014-12-02 00:00:00 -2694.          2241.                2241.
## 3 2014-12-03 00:00:00 -2671.          2408.                2408.
## 4 2014-12-04 00:00:00 -2648.          2568.                2568.
## 5 2014-12-06 00:00:00 -2602.          2917.                2917.
## 6 2014-12-08 00:00:00 -2555.          3259.                3259.
## # ℹ 15 more variables: additive_terms_upper <dbl>, weekly <dbl>,
## #   weekly_lower <dbl>, weekly_upper <dbl>, yearly <dbl>, yearly_lower <dbl>,
## #   yearly_upper <dbl>, multiplicative_terms <dbl>,
## #   multiplicative_terms_lower <dbl>, multiplicative_terms_upper <dbl>,
## #   yhat_lower <dbl>, yhat_upper <dbl>, trend_lower <dbl>, trend_upper <dbl>,
## #   yhat <dbl>
# Load required libraries
library(prophet)
library(dplyr)

# Setup and train model
model <- prophet()  # Create a new Prophet model object
model_fit <- fit.prophet(model, data_train)  # Fit the model to the training data
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
# Optionally, make future predictions
future <- make_future_dataframe(model_fit, periods = 365)  # Forecasting for the next 365 days
forecast <- predict(model_fit, future)

# Plotting predicted vs actual and future forecasted values
plot(model_fit, forecast)

Prophet RMSE MAE

library(prophet)
library(dplyr)
library(Metrics)
## 
## Attaching package: 'Metrics'
## The following object is masked from 'package:rlang':
## 
##     ll
## The following object is masked from 'package:forecast':
## 
##     accuracy
# Optionally, make future predictions
future <- make_future_dataframe(model_fit, periods = 365)  # Create a future dataframe for predictions
forecast <- predict(model_fit, future)  # Make predictions

# Merge the forecast with the actual data for evaluation
actual_vs_predicted <- data_train %>%
  left_join(forecast, by = "ds") %>%
  select(ds, y, yhat) %>%
  na.omit()  # Remove rows with NA values

# Calculate MAE and RMSE
mae_value <- mae(actual_vs_predicted$y, actual_vs_predicted$yhat)
rmse_value <- rmse(actual_vs_predicted$y, actual_vs_predicted$yhat)

# Print model summary and performance metrics
summary(model_fit)
##                         Length Class      Mode     
## growth                     1   -none-     character
## changepoints              25   POSIXct    numeric  
## n.changepoints             1   -none-     numeric  
## changepoint.range          1   -none-     numeric  
## yearly.seasonality         1   -none-     character
## weekly.seasonality         1   -none-     character
## daily.seasonality          1   -none-     character
## holidays                   0   -none-     NULL     
## seasonality.mode           1   -none-     character
## seasonality.prior.scale    1   -none-     numeric  
## changepoint.prior.scale    1   -none-     numeric  
## holidays.prior.scale       1   -none-     numeric  
## mcmc.samples               1   -none-     numeric  
## interval.width             1   -none-     numeric  
## uncertainty.samples        1   -none-     numeric  
## specified.changepoints     1   -none-     logical  
## start                      1   POSIXct    numeric  
## y.scale                    1   -none-     numeric  
## logistic.floor             1   -none-     logical  
## t.scale                    1   -none-     numeric  
## changepoints.t            25   -none-     numeric  
## seasonalities              2   -none-     list     
## extra_regressors           0   -none-     list     
## country_holidays           0   -none-     NULL     
## stan.fit                   4   -none-     list     
## params                     6   -none-     list     
## history                    5   tbl_df     list     
## history.dates           1268   POSIXct    numeric  
## train.holiday.names        0   -none-     NULL     
## train.component.cols       4   data.frame list     
## component.modes            2   -none-     list     
## fit.kwargs                 0   -none-     list
cat("MAE:", mae_value, "\n")
## MAE: 625.8949
cat("RMSE:", rmse_value, "\n")
## RMSE: 1025.923

MAE MSE explainations

The Prophet model’s MAE (625.8949) indicates an average prediction error of about 625.89 units, showing how much predictions deviate from actual values on average. The RMSE (1025.923) highlights a larger deviation, emphasizing that some predictions have significant errors. Both metrics help assess the model’s accuracy.

Comparison

The model trained with Prophet in R has an MAE of 625.8949 and an RMSE of 1025.923, while the Prophet model in Python has an MAE of 1678.49 and an MSE of 3654149.48. Since both MAE and RMSE are significantly lower for the Prophet model, it indicates better prediction accuracy and fewer large errors. Thus, the Prophet model in R is the better choice for this dataset.

Arima

daily data

# Filter data_train to start from '2014-12-01' and convert to time series format (ts)
ts_data <- data %>%
  filter(Timestamp >= as.POSIXct("2014-12-01")) %>%
  arrange(Timestamp) %>%
  pull(Weighted_Price) %>%
  ts(start = c(year(as.Date("2014-12-01")), month(as.Date("2014-12-01"))), frequency = 365)

plot(decompose(ts_data))

# Decompose the time series to extract its components
decomp <- decompose(ts_data)

# Extract the seasonal, trend, and random (or residual) components
seasonal <- decomp$seasonal
trend <- decomp$trend
random <- decomp$random

# Remove seasonal and trend components from the original time series data
detrended <- ts_data - seasonal - trend
autoplot(detrended, title = "ts without seasonality and trend")
## Warning in ggplot2::geom_line(na.rm = TRUE, ...): Ignoring unknown parameters:
## `title`

LJUNG BOX

The Box-Ljung test indicates significant autocorrelation in the time series data, rejecting the null hypothesis of no autocorrelation.

# Perform Ljung-Box test for autocorrelation
ljung_box <- Box.test(ts_data, lag = 20, type = "Ljung-Box")
print(ljung_box)
## 
##  Box-Ljung test
## 
## data:  ts_data
## X-squared = 26393, df = 20, p-value < 2.2e-16

PACF

Interpreting the PACF Plot

  1. Y-Axis (Partial ACF):
    • The y-axis represents the partial autocorrelation values. These values measure the correlation between the time series and its lagged values, with the effects of intermediate lags removed.
  2. X-Axis (Lag):
    • The x-axis represents the lag order. Each point on the plot corresponds to the partial autocorrelation at a specific lag.
  3. Horizontal Blue Lines:
    • The horizontal blue lines are the significance bounds, typically set at approximately ±1.96/√N, where N is the number of observations in the time series. Values outside these bounds are considered statistically significant.
  4. Bars:
    • The vertical bars represent the partial autocorrelation values at different lags. If a bar extends beyond the significance bounds, it indicates a significant partial autocorrelation at that lag.

Key Observations from the Plot

  • Significant Lags:
    • The plot shows that the first lag (very close to 0.014) has a significant positive partial autocorrelation, as indicated by the bar extending beyond the upper significance bound.
    • Subsequent lags are within the significance bounds, suggesting that they do not have significant partial autocorrelation values.

Implications

  • The significant partial autocorrelation at the first lag suggests that after differentiation, the time series data has a significant relationship with its immediate past value.
  • This can imply that a simple autoregressive model of order 1 (AR(1)) may be sufficient to model the differentiated series. The differentiation has likely removed any higher-order correlations that were present in the original non-stationary series.
  • If the original series was an integrated series of order 1 (i.e., needed one differentiation to become stationary), this PACF plot of the differentiated series indicates that it is now a stationary AR(1) process.
# Differentiate the time series data
diff_ts_data <- diff(ts_data)

# Plot the differentiated time series data
plot(diff_ts_data, main = "Differentiated Time Series")

# PACF (Partial Autocorrelation Function) plot of differentiated data
pacf_plot <- pacf(diff_ts_data, lag.max = 5)

plot(pacf_plot, main = "Partial Autocorrelation Function (PACF) of Differentiated Time Series")

ACF

# Differentiate the time series data
diff_ts_data <- diff(ts_data)

# Plot the differentiated time series data
plot(diff_ts_data, main = "Differentiated Time Series")

# ACF (Autocorrelation Function) plot of differentiated data
acf_plot <- acf(diff_ts_data, lag.max = 5)

plot(acf_plot, main = "Autocorrelation Function (ACF) of Differentiated Time Series")

staionerity testt

The Augmented Dickey-Fuller test statistic is -2.8671 with a p-value of 0.2113, indicating insufficient evidence to reject the null hypothesis of non-stationarity in the time series data.

# Augmented Dickey-Fuller Test for stationarity
adf_test <- adf.test(ts_data)
print(adf_test)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_data
## Dickey-Fuller = -2.8671, Lag order = 11, p-value = 0.2113
## alternative hypothesis: stationary

ARIMA MODEL

The ARIMA(0,1,0) model for ‘ts_data’ uses first-order differencing. It has RMSE of 295.0036 and MAE of 125.1681, indicating moderate predictive accuracy and low residual autocorrelation (ACF1 = -0.0183).

# Fit ARIMA model
arima_model <- auto.arima(ts_data)
summary(arima_model)
## Series: ts_data 
## ARIMA(0,1,0) 
## 
## sigma^2 = 87089:  log likelihood = -9998.54
## AIC=19999.08   AICc=19999.08   BIC=20004.32
## 
## Training set error measures:
##                    ME     RMSE      MAE  MPE MAPE      MASE        ACF1
## Training set 4.295254 295.0036 125.1681 -Inf  Inf 0.0421141 -0.01833056
# Forecast future values
forecast_results <- forecast(arima_model, h = 365)  # Forecasting for 365 days into the future

# Extract mean forecast and prediction intervals
forecast_df <- data.frame(date = time(forecast_results$mean),
                          mean = forecast_results$mean)

# Check if lower and upper prediction intervals are available
if ("lower" %in% colnames(forecast_results) && "upper" %in% colnames(forecast_results)) {
  forecast_df <- cbind(forecast_df, lower = forecast_results$lower, upper = forecast_results$upper)
}

autoplot(forecast_results)

summary(arima_model)
## Series: ts_data 
## ARIMA(0,1,0) 
## 
## sigma^2 = 87089:  log likelihood = -9998.54
## AIC=19999.08   AICc=19999.08   BIC=20004.32
## 
## Training set error measures:
##                    ME     RMSE      MAE  MPE MAPE      MASE        ACF1
## Training set 4.295254 295.0036 125.1681 -Inf  Inf 0.0421141 -0.01833056

monnthly data

# Assuming 'data' is your dataframe with columns 'Timestamp' and 'Weighted_Price'

# Convert 'Timestamp' to Date format (if it's POSIXct)
data$Timestamp <- as.Date(data$Timestamp)

# Aggregate daily data to monthly averages (or sums if appropriate)
monthly_data <- data %>%
  group_by(year = year(Timestamp), month = month(Timestamp)) %>%
  summarise(Weighted_Price = mean(Weighted_Price)) %>%
  ungroup()
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
# Create time series object (ts) from monthly aggregated data
ts_data_month <- ts(monthly_data$Weighted_Price, start = c(min(monthly_data$year), min(monthly_data$month)), frequency = 12)

# Print the monthly time series data
print(ts_data_month)
##              Jan         Feb         Mar         Apr         May         Jun
## 2014   241.38894    94.01395   209.31075   270.45916   212.99045   199.30096
## 2015   422.78910   387.11542   373.07645   375.02507   420.81790   460.89656
## 2016   818.30801   887.86747   986.86846  1096.89919  1211.67480  1824.05567
## 2017 15145.08148 13049.36142  9372.40978  9094.04615  7944.96273  8478.67284
##              Jul         Aug         Sep         Oct         Nov         Dec
## 2014   221.30130   279.67248   244.38123   226.26611   247.29074   347.49424
## 2015   639.82650   643.72686   581.96052   604.72387   639.62294   723.87999
## 2016  2612.99223  2497.87300  3813.00553  4122.07821  5283.43203  7674.87378
## 2017  6806.10031  7079.32300  6701.74776  6599.50915  6423.81010  6381.69332

plot and summary

# Example: Plotting the time series
plot(ts_data_month, main = "Monthly Time Series Plot")

# Example: Computing statistics
summary(ts_data_month)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    94.01   366.68   771.09  3020.29  6392.22 15145.08

To perform decomposition, ACF, PACF, ADF test, stationarity test, ARIMA modeling, forecasting, and plotting for your monthly time series data ts_data_month, here’s a structured approach in R:

1. Decomposition

  1. Observed:
    • The top panel shows the original time series data. This is the combination of the trend, seasonal, and random components.
    • The series starts around 2014 and ends in 2018, with an apparent overall increase over time, particularly a significant spike around 2017.
  2. Trend:
    • The second panel displays the trend component, which represents the long-term progression of the time series.
    • This component captures the underlying direction in the data. The trend here shows a steady increase, particularly from around 2016 onward, indicating a general upward movement in the data over the observed period.
  3. Seasonal:
    • The third panel shows the seasonal component, capturing the repeating short-term cycle in the data.
    • This component highlights the periodic fluctuations that repeat over a specific period (e.g., yearly, quarterly). In this series, there is a clear seasonal pattern with peaks and troughs that seem to repeat annually.
  4. Random (Residual):
    • The bottom panel displays the residual component, which represents the irregular or random fluctuations in the time series that are not explained by the trend or seasonal components.
    • This component includes noise and any other patterns not captured by the trend or seasonal components. There appears to be a significant spike in the residuals around the same time as the spike in the observed data in 2017, indicating some unusual or unexpected event.

Interpretation

  • Trend: The increasing trend suggests a long-term growth in the observed phenomenon. This could be due to various factors such as economic growth, technological advancements, or other underlying factors driving the increase.
  • Seasonal: The repeating pattern indicates a consistent seasonal effect, which could be due to factors such as weather, holidays, or other cyclical events.
  • Random: The residuals show the randomness or noise in the data. Significant spikes in this component can indicate outliers or anomalies in the data that may warrant further investigation.
# Perform decomposition
decomp <- decompose(ts_data_month)

# Plot decomposition
plot(decomp)

2. ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Function)

The PACF plot suggests that an autoregressive model of order 1 (AR(1)) might be appropriate for this time series since only the first lag shows a significant partial autocorrelation. If higher-order lags had shown significant partial autocorrelation, it would suggest considering a higher-order AR model. In summary, this PACF plot indicates that the time series likely has a significant autocorrelation at lag 1, suggesting the use of an AR(1) model for further analysis.

# ACF plot
acf_plot <- acf(ts_data_month, lag.max = 5)

plot(acf_plot, main = "Autocorrelation Function (ACF)")

# PACF plot
pacf_plot <- pacf(ts_data_month, lag.max = 5)

plot(pacf_plot, main = "Partial Autocorrelation Function (PACF)")

3. Augmented Dickey-Fuller (ADF) Test for Stationarity

library(forecast)
library(urca)
library(TSA)
## Registered S3 methods overwritten by 'TSA':
##   method       from    
##   fitted.Arima forecast
##   plot.Arima   forecast
## 
## Attaching package: 'TSA'
## The following object is masked from 'package:readr':
## 
##     spec
## The following objects are masked from 'package:stats':
## 
##     acf, arima
## The following object is masked from 'package:utils':
## 
##     tar
# ADF test
adf_test <- ur.df(ts_data_month, lags = 1, type = "drift")  # Adjust lags as needed
summary(adf_test)
## 
## ############################################### 
## # Augmented Dickey-Fuller Test Unit Root Test # 
## ############################################### 
## 
## Test regression drift 
## 
## 
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2629.2  -324.4  -246.0    17.9  7286.5 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 345.89029  260.43247   1.328    0.191
## z.lag.1      -0.07766    0.05484  -1.416    0.164
## z.diff.lag    0.18143    0.15035   1.207    0.234
## 
## Residual standard error: 1382 on 43 degrees of freedom
## Multiple R-squared:  0.06343,    Adjusted R-squared:  0.01986 
## F-statistic: 1.456 on 2 and 43 DF,  p-value: 0.2444
## 
## 
## Value of test-statistic is: -1.4163 1.1686 
## 
## Critical values for test statistics: 
##       1pct  5pct 10pct
## tau2 -3.58 -2.93 -2.60
## phi1  7.06  4.86  3.94

4. Stationarity Test

# KPSS test (alternative to ADF test)
kpss_test <- kpss.test(ts_data_month)
## Warning in kpss.test(ts_data_month): p-value smaller than printed p-value
print(kpss_test)
## 
##  KPSS Test for Level Stationarity
## 
## data:  ts_data_month
## KPSS Level = 0.92301, Truncation lag parameter = 3, p-value = 0.01

5. ARIMA Modeling

# Fit ARIMA model
arima_model <- auto.arima(ts_data_month)
summary(arima_model)
## Series: ts_data_month 
## ARIMA(0,1,0) 
## 
## sigma^2 = 1883460:  log likelihood = -406.23
## AIC=814.47   AICc=814.55   BIC=816.32
## 
## Training set error measures:
##                   ME     RMSE      MAE      MPE     MAPE      MASE      ACF1
## Training set 127.928 1358.021 547.4638 3.152882 17.70394 0.1915899 0.1408026

6. Forecasting

# Forecast using ARIMA model
forecast_values <- forecast(arima_model, h = 12)  # Forecasting 12 months ahead
print(forecast_values)
##          Point Forecast     Lo 80     Hi 80      Lo 95     Hi 95
## Jan 2018       6381.693 4622.9020  8140.485  3691.8542  9071.532
## Feb 2018       6381.693 3894.3868  8869.000  2577.6863 10185.700
## Mar 2018       6381.693 3335.3775  9428.009  1722.7553 11040.631
## Apr 2018       6381.693 2864.1108  9899.276  1002.0150 11761.372
## May 2018       6381.693 2448.9165 10314.470   367.0301 12396.356
## Jun 2018       6381.693 2073.5521 10689.835  -207.0401 12970.427
## Jul 2018       6381.693 1728.3690 11035.018  -734.9521 13498.339
## Aug 2018       6381.693 1407.0804 11356.306 -1226.3207 13989.707
## Sep 2018       6381.693 1105.3195 11658.067 -1687.8241 14451.211
## Oct 2018       6381.693  819.9069 11943.480 -2124.3249 14887.712
## Nov 2018       6381.693  548.4426 12214.944 -2539.4939 15302.881
## Dec 2018       6381.693  289.0616 12474.325 -2936.1828 15699.569
# Plot forecasted values
plot(forecast_values, main = "Forecasted Values")